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A Fault-Tolerant Mobile Agent for a Computer Network 

Field and Background of the Invention 

The invention relates to a method of operating a mobile agent that travels through a network of a 
number of computers. 

; 5 Such a mobile agent system is known, e.g. from A. Mohindra, A. Purakayastha and P. Thati: 

Exploiting non-determinism for reliability of mobile agent systems", in Proc. of the Int. Conf. On 
Dependable Systems and Networks, pages 144-153, New York, June 2000. 

One concern in connection with such a mobile agent system is the fact that failures may lead to 

n blocking or a complete loss of the mobile agent. This problem may be solved by replication of the 

jf 0 mobile agent. However, this leads to the so-called exactly-once execution problem which has to 

n j be fulfilled. In the above mentioned prior art document, this problem is solved by detecting 

Y u multiple mobile agents at the end of any execution and by undoing all effects of multiple 

Jf j executions. However, such an undoing function is not simple and often limits the overall system 

s throughput. 

f ¥5 Summary of the Invention 

It is an object of the invention to provide a method of operating a mobile agent which is 
fault-tolerant without being too complex. 

This object is solved by one aspect of the present invention, which provides a method of 
operating a mobile agent that travels through a network of a number of computers, wherein the 

20 mobile agent is executed in a sequence of stages and wherein each stage comprises a set of 

places, the method comprising the following steps: executing the mobile agent in at least one of 
the set of places of a respective one of the stages, evaluating in which place of the respective 
stage the mobile agent has been executed successfully, agreeing on this place amoig the set of 
places, aborting and/or undoing any operation in connection with the mobile agent in any other 

25 place of the respective stage, and moving the modified mobile agent resulting from the successful 
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execution to the next stage. 

As well, this object is solved by the computer program product that contains instructions 
implementing the steps of the foregoing method, and still further, whereby the foregoing method 
steps are managed by a fault-tolerance enabler (FTE) which is independent of the mobile zgent. 

The invention uses the replication of the mobile agent so that a set of places is available within a 
sequence of stages in which the mobile agent is executed. In order to prevent blocking and to 
solve the exactly-once execution problem, the invention includes the idea to model the execution 
of the mobile agent and its replication as a sequence of agreement problems. 

According to the invention, the mobile agent is executed in at least one of the set of places of a 
respective one of the stages. Then, it is evaluated in which place of the respective stage the 
mobile agent has been executed successfully. After this step, any operation in connection with the 
mobile agent in any other place of the respective stage is aborted and/or undone. Finally, the 
modified mobile agent resulting from the successful execution is moved to the next stage. 

This method ensures that only exactly one execution of the mobile agent within the set of places 
of the respective stage is committed whereas all other possible executions are aborted and/or 
undone. 

The implementation of the inventive method may preferably be done by a so-called 
fault-tolerance enabler (FTE) which may be programmed as an independent component but which 
may then travel to the places of the stages together with the mobile agent. 

Further advantages and embodiments of the invention are apparent from the further claims and/or 
from the following description of the drawings. 

Brief Description of the Drawings 



Examples of the invention are depicted in the drawings and are described in detail below by way 
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of example. It is shown in 

Figure la: a schematic representation of a method of operating a mobile agent according to an 
embodiment of the invention; 

Figure lb: a schematic representation of the method of figure la comprising a failure; 

5 Figure 2: a schematic block diagram of a consensus method according to an embodiment of the 
invention; and 

Figure 3: a schematic block diagram of an architecture of the mobile agent according to an 
embodiment of the invention. 

Si All the figures are for sake of clarity not shown in real dimensions, nor are the relations between 
i0 the dimensions shown in a realistic scale. 

0] Detailed Description of Embodiments of the Invention 

51 In the following, the various exemplary embodiments of the invention are described. 

A mobile agent is a computer program that acts autonomously on behalf of an agent owner or 
user and that travels through a network of a number of computers. Failures in such a system may 
1 5 lead to a blocking of the execution of the mobile agent or to a partial or complete loss of the 
mobile agent. As well, the agent owner often does not know whether the mobile agent is actually 
lost due to the failure or whether its execution has only been delayed due to slow computers. The 
agent owner may then believe that the mobile agent has been lost when in fact it has not been, or 
he waits for the mobile agent to finish when it has failed. 

20 This uncertainty may be removed by a mobile agent with a fault-tolerant execution. The mobile 
agent then either reaches its destination or at least notifies a problem. 
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Such fault-tolerance may be gained by replicating the mobile agent. Replication of the mobile 
agent is similar to the addition of redundancy and enables the mobile agent to continue its 
execution despite failures. The blocking of the mobile agent, therefore, is prevented. 

However, the replication of the mobile agent may lead to the violation of the so-called 
5 exactly-once execution property of the execution of the mobile agent. If, for example, a mobile 
agent is executed on a first computer and fails, then the first computer may survive, however, 
comprising modifications performed by the failing mobile agent. A replication of the mobile 
agent is then executed on a second computer performing modifications of the second computer. 
This results in modifications in the first and the second computer which contradicts the 

10 exactly-once execution property. This property is also violated if the failure of a mobile agent is 
detected, however, the mobile agent has actually not failed. In this case, the unreliable failure 
detection leads to a double execution of the mobile agent which, as mentioned, contradicts the 

PJ exactly-once execution property. 

:! The idea is to model the execution of the mobile agent and its replication as a sequence of 
f I agreement problems. For that purpose, the following assumptions are taken and explained now in 
p connection with figure 1 a. 

As already described, a mobile agent a* executes on a sequence of computers; wherein i = 0 ... n. 
A place p v provides a logical execution environment for the mobile agent a* wherein each 
computer may host multiple places pi. The execution of the mobile agent ea at a place pi is called a 
20 stage Si. The replicas of the mobile agent a* execute on different places p, j within one and the 
same stage Si. Two stages Si and Sri are separated by a move operation of the mobile agent a*. 
The places p x j where the first and the last execution of the mobile agent a. take place are called the 
source p 0 ° and the destination p n ° of the mobile agent a 1 which may be identical. 

According to figure la, the mobile agent ao is executed in the place p 0 ° of stage S 0 which is the 
25 source of the mobile agent. Then, after successfully executing the mobile agent ao ? the agreement 
problem is solved by a decision <a u Mi>p 0 ° in which a } is the resulting mobile agent after 
executing the mobile agent ao at the place p 0 ° of the stage So, Mi is the set of places pi j of the next 
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stage S u and p 0 ° is that place of the stage S 0 which has successfully executed the mobile agent a<>. 
The evaluation of the aforementioned decision will be explained later. 

Due to this decision, the mobile agent ai enters the next stage Si at the place pi 0 and is executed 
there. According to figure la, the stage Si comprises the further places pi 1 , pi 2 and pi 3 in which 
5 replicas of the mobile agent ai may be executed. However, after successfully executing the mobile 
agent ai at place pi 0 of the stage Si, the agreement problem is solved at once, i.e. it is agreed 
among the set Mi of places pi 0 , pi 1 , pi 2 and pi 3 that the place pi 0 has executed the mobile agent ai 
successfully. This leads to a decision <a 2 , M 2 >pi° in which a 2 is the resulting mobile agent after 
executing the mobile agent ai at stage Si, M 2 is the set of places of the next stage S 2 , and pi 0 is 
10 that place of the stage Si which has successfully executed the mobile agent a*. 

*ll According to figure la, this procedure is continued through the sequence of stages S* until the 

W destination of the mobile agent is reached. There, the mobile agent ^ enters the stage S 4 and is 

U executed in the only place p 4 °. 

ffi In figure la, no failure occurs. This means that none of the computers fails, none of the places 

S5 fails, and the execution of none of the mobile agents fails. Moreover, no incorrect failure 

21 detection is present. Therefore, the mobile agent is always executed in the first place of any of 

"£ ] those stages which comprise more than one place, i.e. in the places pi 0 , p 2 ° and p 3 ° of the stages Si, 

M S 2 and S 3 . Therefore, these places pi° 5 p 2 ° and p 3 ° are also part of the respective decision after the 
execution of the mobile agents in the respective stages. 

20 In contrast thereto, figure lb comprises a failure of the place p 2 ° of the stage S 2 . This is depicted 
in figure lb with the expression "crash". 

When the place p 2 ] detects the failure of the place p 2 °, it executes a replica of the mobile agent a 2 . 
It has to be mentioned that the place p 2 ° is the first one in the sequence of the set M 2 of the places 
p 2 °, pA p 2 2 and p 2 3 of the stage S 2 which executes the mobile agent a 2 . The next place p 2 ] is able to 
. 25 monitor the execution of the mobile agent a 2 in the preceding place p 2 °. Upon detection of a 
failure of the mobile agent a 2 or the place p 2 °, the next place p 2 * starts executing the replica of the 
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mobile agent a 2 . 

After successfully executing the replica of the mobile agent a 2 in the place p 2 l of the stage S 2; the 
agreement problem is solved. It is agreed among the set M 2 of places p 2 °, p 2 \ p 2 2 and p 2 3 in which 
place the mobile agent has been executed successfully. As described, this is the place p 2 ] . This 
5 leads to a decision <a 3 , M 3 >p 2 * in which a 3 is the resulting mobile agent after executing the mobile 
agent a 2 at stage S 2s M 3 is the set of places p 3 j of the next stage S 3 , and p 2 * is that place of the stage 
S 2 which has successfully executed the mobile agent a 2 . 

The important difference between figure la and figure lb, therefore, is that the decision after 
stage S 2 of figure lb comprises the place p 2 * as successfully executing the mobile agent a 2 
10 whereas the decision after the stage S 2 of figure la comprises the place p 2 °. The decision of figure 
y lb, therefore, recognizes the fact that the execution of the mobile agent a 2 failed in the place p 2 ° of 
0] stage S 2 of figure lb. 

^ The decisions that are taken in each of the stages Si of the figures la and lb are evaluated by 
Q] using a consensus method which will be explained now in connection with figure 2. 

yj5 Figure 2 shows a stage Si which may be any of the stages shown in figures la and lb. The stage Si 
yp comprises the corresponding mobile agent ai and a so-called fault-tolerance enabler (FTE) as two 
Tl independent components. 

If the stage Si is entered from a preceding stage, the FTE starts to solve the agreement problem for 
this stage Si (see block 20). For that purpose, the block 20 initiates (see arrow 21) the operation of 
20 the stage Si (see block 22), so that the mobile agent a; is executed in the places p, j of the stage Si 
sequentially. As soon as one of the places p^ successfully executes the mobile agent a;, this is 
recognized by the block 20 of the FTE (see arrow 23). This successful place is agreed upon 
among the set Mi of places and is then called the primary place p pnm . 

The block 20 of the FTE then confirms to ail places p, j of the stage Si that the primary place pi prim 
25 is committed and that all other places have to abort and/or undo any operation in connection with 
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the mobile agent a*. 

Except for the primary place pi prim , any operation in connection with the mobile agent a* is then 
aborted and/or undone (see block 24 and block 25). As soon as this phase is finished, this is 
recognized by the FTE (see arrow 26). 

5 The decision of the agreement problem of the current stage Si is then present in the FTE (see 
block 27). This decision was already described above. The aforementioned primary place p, prim is 
identical with those places of figures la and lb which have successfully executed the respective 
mobile agent a { . In particular, with regard to figure lb, the primary place pi prim of stage S 2 is the 
successful place P2 1 and not the failing place P2 0 . 

t6 The block 27 of the FTE then moves the resulting mobile agent a*+i together with the generated 
W decision, in particular together with the set M i+ i of the places p i+ i j of the next stage Si+i to this next 
il stage S 1+i (see arrow 28). This move of the resulting mobile agent a*i is performed as a reliable 
forward function. 

f l For that purpose, each place p, j of stage Si sends a clone of the resulting mobile agent a i+ i to all 
K places p 1+! j of the stage Sj+i. In order to reduce communication overhead, it is possible that only 
-J J the primary place p, prim of the stage Si sends the resulting mobile agent a i+ i to all places p i+ i J of the 
Zl stage Sh-i and that all other places of the stage Si only verify whether the resulting mobile agent 

aj+i has arrived at the places p 1+ i j of the stage S i+b e.g. by accessing the corresponding value in a 

repository of these places pi+i j . 

20 As shown in figure 2, the block 20 of the FTE then starts to solve the agreement problem for this 
next stage Si+i. 

The described consensus method is implemented with a so-called agent-dependent architecture. 
As shown in figure 3, the FTE is integrated into the mobile agent a* and travels with it to the 
sequential places p^. Only one instance of the FTE exists per mobile agent a* which is initialized 
25 by the user-defined agent 30 at the source of the mobile agent a. 
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The FTE is composed of a stage agreement component 31, a reliable forwarding component 32 
and a recovery component 33. The stage agreement component 31 performs the consensus 
method, the reliable forwarding component 32 is responsible for reliably forwarding the resulting 
mobile agent ea+\ to the next stage, and the recovery component 33 handles any necessary 
5 recovery in case the mobile agent ^ fails or arrives too late at one of the places p t J . 

The FTE provides a FTE-specific application programming interface 34 for the communication 
with the user-defined agent 30. The respective place p/ provides a repository 35 and further 
services 36. The repository 35 is a location where place-specific information may be stored 
temporarily. For example, the decision generated by the FTE may be stored in the repository 35, 
10 in particular the primary place p, prim . This information can then be kept until all other places of the 
Cf respective stage Si are aware of this decision. The information may then be discarded after a 
oj certain time. 
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